Goals

I’m going to try working through this lab as I also try to update it. The original lab mostly is good, but it teaches some less than useful or appropriate skills. So I’m going to replace some components with some ideas of my own. As I do this, I want to make sure everything works.

Getting started

First, I need to read in the data from my working directory.

Here are the packages I need:

library(tidyverse)
library(modelr)

And here’s the data:

dt <- read_csv("Columbus_2020_listings.csv")

There are a bunch of variables in here:

dim(dt)
## [1] 1409  106

Okay, time for some brain work. I’m telling students to narrow down to 5-7 variables that they think will be most relevant for predicting housing price. I should probably do the same. There’s a lot in here, and many more than 7 variables are probably predictive. Here are some promising ideas:

Host details

Neighborhood details

Property details

Pricing

Satisfaction

Misc

Okay, so I have my picks. Now I need to do some cleaning and refining:

dt %>%
  transmute(
    ## the outcome(s)
    across(c(price, 
             security_deposit, 
             cleaning_fee),
           ~ str_remove(.x, "\\$") %>%
             as.numeric()),
    
    ## the predictors
    host_acceptance_rate = 
      str_remove(host_acceptance_rate, "\\%") %>%
      as.numeric(),
    host_is_superhost = host_is_superhost+0,
    neighbourhood_cleansed,
    room_type,
    number_of_reviews,
    review_scores_rating,
    cancellation_policy,
    
    ## keep location details for mapping
    longitude,
    latitude
  ) -> dtclean
glimpse(dtclean)
## Rows: 1,409
## Columns: 12
## $ price                  <dbl> 198, 47, 80, 90, 30, 60, 75, 199, 235, 75, 32, …
## $ security_deposit       <dbl> 350, 0, 0, 125, 0, 0, 0, 0, 200, 0, 0, NA, NA, …
## $ cleaning_fee           <dbl> 70, 20, 25, 75, 0, 0, 5, 125, 75, 5, 0, NA, NA,…
## $ host_acceptance_rate   <dbl> 96, 100, 100, 87, 98, 98, 100, 100, 54, 100, 98…
## $ host_is_superhost      <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ neighbourhood_cleansed <chr> "Near North/University", "Near North/University…
## $ room_type              <chr> "Entire home/apt", "Private room", "Private roo…
## $ number_of_reviews      <dbl> 378, 97, 220, 131, 218, 44, 165, 131, 64, 131, …
## $ review_scores_rating   <dbl> 96, 93, 98, 98, 95, 94, 99, 100, 97, 99, 97, 90…
## $ cancellation_policy    <chr> "strict_14_with_grace_period", "moderate", "mod…
## $ longitude              <dbl> -83.00321, -83.00986, -82.97968, -83.00067, -83…
## $ latitude               <dbl> 39.98394, 40.01243, 39.96086, 40.01898, 40.0125…

The room type and cancellation policy columns may make more sense as ordered categories:

## how many unique categories
table(dtclean$room_type) # 4
## 
## Entire home/apt      Hotel room    Private room     Shared room 
##            1043               3             346              17
table(dtclean$cancellation_policy) # 4 (but really 3)
## 
##                    flexible                    moderate 
##                         433                         508 
## strict_14_with_grace_period             super_strict_30 
##                         467                           1
## okay let's update the order
library(socsci)
dtclean %>%
  mutate(
    room_type = frcode(
      room_type == "Shared room" ~ "Shared room",
      room_type %in% 
        c("Hotel room", "Private room") ~ "Hotel/Private room",
      room_type == "Entire home/apt" ~ "Entire home/apt"
    ),
    cancellation_policy = frcode(
      cancellation_policy == "flexible" ~ "Flexible",
      cancellation_policy == "moderate" ~ "Moderate",
      TRUE ~ "Strict/Super strict"
    )
  ) -> dtclean

Polished Visualizations

I need to make two polished visuals showing the relationship between some predictors and price.

Visual 1:

coolorrr::set_theme()
dtclean %>%
  group_by(room_type) %>%
  mean_ci(log(price)) %>%
  ggplot() +
  aes(x = exp(mean),
      xmin = exp(lower),
      xmax = exp(upper),
      y = room_type,
      label = paste0("N = ", 
                     scales::comma(n))) +
  geom_pointrange(
    aes(size = n),
    show.legend = F
  ) +
  scale_size_continuous(
    range = c(0.1, 1)
  ) +
  geom_text(vjust = 1.75) +
  scale_x_continuous(
    labels = scales::dollar
  ) +
  labs(
    x = "Avg. Housing Price",
    y = NULL,
    title = "AirBnB Price by Kind of Housing"
  )

Visual 2:

ggplot(dtclean %>%
         filter(host_acceptance_rate>0)) +
  aes(x = host_acceptance_rate,
      y = price) +
  geom_jitter(
    aes(y = 40),
    color = "darkgray",
    alpha = 0.25,
    height = 5,
    width = 1
  ) +
  scale_y_continuous(
    breaks = seq(40, 150, by = 10),
    labels = scales::dollar
  ) +
  stat_smooth(
    method = "glm",
    method.args = list(family = poisson),
    formula = y ~ x,
    color = "darkblue"
  ) +
  labs(
    x = "Host Acceptance Rate (%)",
    y = "Price of Housing",
    title = "Higher acceptance = higher price",
    subtitle = "Poisson line of best fit"
  )

A map

Now I need to make a map of Columbus to summarize housing locations. First I need to install {ggmap} by running devtools::install_github("dkahle/ggmap"). Then I can open it:

library(ggmap)

Now I need the coordinates for Columbus:

## coordinates
map <- c(left=-83.2, bottom=39.8, right=-82.75, top=40.16)

## state map
cbus_map <- get_stamenmap(map, zoom = 10, maptype = "toner")

Now let’s check it out:

ggmap(cbus_map) +
  theme_void()

That’s cool. So I should be able to add the coordinates of different AirBnBs by writing:

ggmap(cbus_map) +
  geom_point(
    data = dtclean,
    aes(x = longitude,
        y = latitude,
        size = price),
    alpha = 0.05,
    color = "darkblue",
    show.legend = F
  ) +
  labs(title = "Location of AirBnBs in Columbus",
       subtitle = "Larger points indicate higher prices") +
  theme(
    axis.line = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank()
  )

It looks like it’s expensive to stay close to downtown. It may be worth adding a column that’s distance to the city center:

ggmap(cbus_map)

library(geosphere)
## lon = -83 : lat = 39.96
ccenter <- c(-83, 39.96)
dtclean$distance <- with(dtclean, distHaversine(ccenter, cbind(longitude, latitude))) / 1000
ggplot(dtclean) +
  aes(
    x = distance,
    y = price
  ) +
  geom_point() +
  scale_y_continuous(
    labels = scales::dollar
  ) +
  #scale_x_log10() +
  labs(
    x = "Distance from city center (km)",
    y = "Price",
    title = "It pays to have a location downtown"
  )

Bivariate regression

We can make a simple bivariate regression modeling the log of prices as a function of the host acceptance rate:

simple_fit <- lm(price ~ distance, dtclean)
summary(simple_fit)
## 
## Call:
## lm(formula = price ~ distance, data = dtclean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -132.22  -78.76  -45.29   14.68  866.59 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 176.4357     6.3730  27.685   <2e-16 ***
## distance     -8.1047     0.9453  -8.574   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 155.4 on 1369 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.05096,    Adjusted R-squared:  0.05027 
## F-statistic: 73.52 on 1 and 1369 DF,  p-value: < 2.2e-16

Check for multicolinearity

library(GGally)
ggpairs(dtclean %>% select(security_deposit,
                           host_acceptance_rate,
                           host_is_superhost,
                           distance)) +
  theme_classic()

Use cross-validation to pick the best model

library(modelr)
## How about we just compare 4 models?

## Step 1: use MC cross-validation
cv_dt <- dtclean %>%
  crossv_mc(n = 500)

## Step 2: specify a list of model specifications
spec_list <- list(
  log(price) ~ 1, # NULL model for reference
  log(price) ~ room_type,
  log(price) ~ room_type + distance,
  log(price) ~ room_type + host_is_superhost,
  log(price) ~ room_type + host_acceptance_rate
)

## Step 3: train
cv_dt %>%
  expand_grid(., specs = spec_list) %>%
  mutate(
    model_num = rep(1:5, len = n()),
    models = map2(specs, train, ~ lm(.x, .y))
  ) -> cv_out

## Step 4: validate performance
cv_out %>%
  mutate(
    rmse = map2(models, test, ~ rmse(.x, .y))
  ) -> cv_out

## Step 5: collect the columns I need
cv_out %>%
  select(.id, model_num, rmse) %>%
  unnest -> cv_results

## Step 6: visualize the results
cv_results %>%
  group_by(model_num) %>%
  mutate(
    median = mean(rmse)
  ) %>%
  ggplot() +
  aes(x = model_num,
      y = rmse) +
  geom_jitter(
    color = "darkgray",
    width = .1,
    alpha = 0.4
  ) +
  geom_line(
    aes(y = median)
  ) +
  geom_point(
    aes(y = median),
    color = "red",
    size = 2
  ) +
  scale_x_continuous(
    labels = c("NULL",
               "type",
               "type + distance",
               "type + superhost",
               "type + acceptance")
  ) +
  labs(x = NULL,
       y = "RMSE",
       title = "MC Cross-validation Results",
       caption = "(500 MC iterations)")

Seems like using the housing type and distance to downtown do the best!

Make predictions

So we know this is the best:

modelfit <- lm(log(price) ~ room_type + distance, dtclean)
summary(modelfit)
## 
## Call:
## lm(formula = log(price) ~ room_type + distance, data = dtclean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3948 -0.3906 -0.1062  0.3129  2.3008 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  3.489685   0.155789  22.400  < 2e-16 ***
## room_typeHotel/Private room  0.580033   0.157774   3.676 0.000246 ***
## room_typeEntire home/apt     1.415828   0.155372   9.112  < 2e-16 ***
## distance                    -0.022971   0.004055  -5.666 1.78e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6347 on 1367 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.3127, Adjusted R-squared:  0.3112 
## F-statistic: 207.3 on 3 and 1367 DF,  p-value: < 2.2e-16

Low/med/high price predictions:

predata <- tibble(
  room_type = sort(unique(dtclean$room_type)),
  distance = c(20, 10, 0)
)
preds <- paste0("$",round(exp(predict(modelfit, predata)), 2))
cbind(price = preds, predata)
##     price          room_type distance
## 1   $20.7        Shared room       20
## 2  $46.53 Hotel/Private room       10
## 3 $135.03    Entire home/apt        0

You would net the highest nightly rate with an AirBnB that’s an entire home/apartment close to the city center. If you only can offer a shared room in a far-out suburb, good luck making much money!